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Abstract 

We study prediction in the functional linear model with functional outputs : Y = SX + e 
where the covariates X and Y belong to some functional space and S is a linear operator. We 
provide the asymptotic mean square prediction error with exact constants for our estimator 
which is based on functional PCA of the input and has a classical form. As a consequence we 
derive the optimal choice of the dimension k n of the projection space. The rates we obtain 
are optimal in minimax sense and generalize those found when the output is real. Our main 
results hold with no prior assumptions on the rate of decay of the eigenvalues of the input. 
This allows to consider a wide class of parameters and inputs X (•) that may be either very 
irregular or very smooth. We also prove a central limit theorem for the predictor which 
improves results by Cardot, Mas and Sarda (2007) in the simpler model with scalar outputs. 
We show that, due to the underlying inverse problem, the bare estimate cannot converge in 
distribution for the norm of the function space. 

Keywords : Functional data; Linear regression model; Functional output; Prediction mean 
square error; Weak convergence; Optimality. 

1 Introduction 



1.1 The model 

Functional data analysis has become these last years an important field in statistical research, 
showing a lot of possibilities of applications in many domains (climatology, teledetection, lin- 
guistics, economics, ...). When one is interested on a phenomenon continuously indexed by 
time for instance, it seems appropriate to consider this phenomenon as a whole curve. Practical 
aspects also go in this direction, since actual technologies allow to collect data on thin discretized 
grids. The papers by Ramsay and Dalzell (1991) and Frank and Friedman (1993) began to pave 
the way in favour of this idea of taking into account the functional nature of these data, and 
highlighted the drawbacks of considering a multivariate point of view. Major references in this 
domain are the monographs by Ramsay and Silverman (2002, 2005) which give an overview 
about the philosophy and the basic models involving functional data. Important nonparametric 
issues are treated in the monograph by Ferraty and Vieu (2006). 

A particular problem in statistics is to predict the value of an interest variable Y knowing a 
covariate X. An underlying model can then write : 

Y = r{X)+e, 

where r is an operator representing the link between the variables X and Y and e is a noise 
random variable. In our functional data context, we want to consider that both variables X and 
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Y are of functional nature, i.e. are random functions taking values on an interval / = [a,b] of 
M. We assume that X and Y take values in the space L 2 (I) of square integrable on /. In the 
following and in order to simplify, we assume that / = [0, 1], which is not restrictive since the 
simple transformation x i — > (x — a)/(b — a) allows to come back to that case. 

We assume as well that X and Y are centered. The issue of estimating the means E (X) 
and E (Y) in order to center the data was exhaustively treated in the literature and is of minor 
interest in our setting. The objective of this paper is to consider the model with functional input 
and ouptut : 

Y(t)= f S(s,t)X{s)ds + e{t), E(e\X) = 0, (1) 
J o 

where S (•, •) is an integrable kernel : J J \S (s, t) \ dsdt < +oo. The kernel S may be represented 
on a 3D-plot by a surface. The functional historical model (Malfait and Ramsay, 2003) is 

Y(t)= f S hist (s,t)X(s)ds + e(t), 
Jo 

and may be recovered from the first model be setting S (s, t) = S^ist ( s > t) U{ s <t} an d the surface 
defining S is null when (s, t) is located in the triangle above the first diagonal of the unit square. 

Model ([1]) may be viewed as a random Fredholm equation where both the input an the ouput 
are random (or noisy). This model has already been the subject of some studies, as for instance 
Chiou, Miiller and Wang (2004) or Yao, Muller and Wang (2005), which propose an estimation 
of the functional parameter S using functional PCAs of the curves X and Y. One of the first 
studies about this model is due to Cuevas, Febrero and Fraiman (2002) which considered the case 
of a fixed design. In this somewhat different context, they study an estimation of the functional 
coefficient of the model and give consistency results for this estimator. Recently, Antoch et al. 
(2008) proposed a spline estimator of the functional coefficient in the functional linear model 
with a functional response, while Aguilera, Ocaha and Valderrama (2008) proposed a wavelet 
estimation of this coefficient. 

We start with a sample (Yi, Xi) 1<i<n with the same law as (Y,X), and we consider a new 
observation X n+ i. In all the paper, our goal will be to predict the value of Y n+ \. 

The model ([1]) may be revisited if one acknowledges that S (s, t) X (s) ds is the image of X 
through a general linear integral operator. Denoting S the operator defined on and with values 
in L 2 ([0, 1]) by (Sf) (t) = /„* S (s, t) f (a) ds we obtain from Q that Y (t) = S (X) (t) + e (t) or 

Y = SX + e, where S (X) (t) = J S (s,t) X (s) ds. 

This fact motivates a more general framework : it may be interesting to consider Sobolev spaces 
W m,p instead of 1? ([0, 1]) in order to allow some intrinsic smoothness for the data. It turns 
out that, amongst this class of spaces, we should privilege Hilbert spaces. Indeed the unknown 
parameter is a linear operator and spectral theory of these operators acting on Hilbert space 
allows enough generality, intuitive approaches and easier practical implementation. That is why 
in all the sequel we consider a sample (1^, ^Q)i<i<n where Y and X are independent, identically 
distributed and take values in the same Hilbert space H endowed with inner product (•, •) and 
associated norm ||-|| . 

Obviously the model we consider generalizes the regression model with a real output y : 

y= I' f3(s)X(s)ds + e = (P,X)+e, (2) 
J o 

and all our results hold in this direction. The literature is wide about ([2]) but we picked articles 
which are close to our present concerns and will be cited again later in this work : Yao, Muller 
and Wang (2005), Hall and Horowitz (2007), Crambes, Kneip, Sarda (2009)... 
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Since the unknown parameter is here an operator, the infinite-dimensional equivalent of a 
matrix, it is worth giving some basic information about operator theory on Hilbert spaces. The 
interested reader can find basics and complements about this topic in the following reference 
monographs : Akhiezer and Glazman (1981), Dunford and Schwartz (1988), Gohberg, Goldberg 
and Kaashoek (1991). We denote by £ the space of bounded -hence continuous- operators on 
a Hilbert space H. For our statistical or probabilistic purposes, we restrain this space to the 
space of compact operators C c . Then, any compact and symmetric operator T belonging to C c 
admits a unique Schmidt decomposition of the form T = X^jeN^i^i ® $j wnere the (^,^-)'s 
are called the eigenelements of T, and the tensor product notation ® is defined in the following 
way: for any function /, g and h belonging to H, we define / <g) g = {g, .} / or 

\f®g] (h) (s)=( fg(t)h(t)dt]f(s). 



Finally we mention two subclasses of C c one of which will be our parameter space. The space 
of Hilbert-Schmidt operators and trace class operators are defined respectively by 

£ 2 = { T € C c ■ X) < +°° \ > £i = \ T e As : Pi < +°° 

It is well-known that if S is the linear operator associated to the kernel S like in display ([I]) 
then if f J \S (s, t) \ dsdt < +oo, S is Hilbert-Schmidt and S is trace class if S (s, t) is continuous 
as a function of (s,t). 

1.2 Estimation 

Our purpose here is first to introduce the estimator. This estimate looks basically like the one 
studied in Yao, Miiller and Wang (2005). Our second goal is to justify from a more theoretical 
position the choice of such a candidate. 

Two strategies may be carried out to propose an estimate of S. They join finally, like in the 
finite-dimensional framework. One could consider the theoretical mean square program (convex 
in S) 

minEllY - SX\\ 2 , 

Se£ 2 

whose solution S* is defined by the equation E [Y ® X] = S*E [X (g> X] . On the other hand it is 
plain that the moment equation : 

E [Y <g> X] = E [S (X) ® X] + E [e ® X] 

leads to the same solution. Finally denoting A = E [Y ® X ] , T = E [X (g> X] we get A = ST. 
Turning to empirical counterparts with 



-y n \ n 

A n = - V Yi <8) x i} r n = -Vx i ®x i , 

n ^— ' n L — ' 

4=1 4=1 



the estimate S n of S should naturally be defined by A n = S n T n .Once again the moment method 
and the minimization of the mean square program coincide. By the way note that A n = ST n +U n 
with U n = \ Ya=i £ i® Xi. The trouble is that, from A n = S n T n we cannot directly derive an 
explicit form for S n . Indeed T n is not invertible on the whole H since it has finite rank. The 
next section proposes solutions to solve this inverse problem by classical methods. 

As a last point we note that if S n is an estimate of S, a statistical predictor given a new 
input X n+ \ is : 

Y„+i (t) = S n (X n+1 ) (t) = fs(s, t) X n+1 (s) ds. (3) 
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1.3 Identifiabiliy, inverse problem and regularization issues 

We turn again to the equation which defines the operator S : A = ST. Taking a one-to one T 
is a first and basic requirement for identifiability. It is simple to check that if v 6 kerT 7^ {0} , 
A = ST = (S + v ® v) T for instance and the unicity of S is no more ensured. More precisely, the 
inference based on the equation A = ST does not ensure the identifiability of the model. From 
now on we assume that kerT = {0} . At this point, some more theoretical concerns should be 
mentioned. Indeed, writing S = Ar _1 is untrue. The operator T _1 exists whenever kerT = {0} 
but is unbounded, that is, not continuous. We refer once again to Dunford and Schwartz (1988) 
for instance for developments on unbounded operators. It turns out that T _1 is a linear mapping 
defined on a dense domain T> of H which is measurable but continuous at no point of his domain. 
Let us denote (Xj, ej) the eigenelements of T. Elementary facts of functional analysis show that 
Sp = AT -1 where V is the domain of T" 1 i.e. the range of T and is defined by 



V 



A link is possible with probability and gaussian analysis which may be illustrative. If T is 
the covariance operator of a gaussian random element X on H (a process, a random function, 
etc) then the Reproducing Kernel Hilbert Space of X coincides with the domain of F~ 1 / 2 and 
the range of F 1 ' 2 : RKHS {X) = \x = J2j xjej G H : £V x)jXj < +00} . 

The last stumbling stone comes from switching population parameters to empirical ones. We 
construct our estimate from the equation A n = ST n + U n as seen above and setting A n = S n T n . 
Here the inverse of T n does not even exist since this covariance operator is finite-rank. If T n 
was invertible we could set S n = A^" 1 but we have to regularize T n first. We carry out 
techniques which are classical in inverse problems theory. Indeed, the spectral decomposition 
of T n is T n = Y2j -\? (fij ® 6j) where ^Aj, e^j are the empirical eigenelements of T n (the Aj's are 
sorted in a decreasing order and some of them may be null) derived from the functional PCA. 
The spectral cut regularized inverse is given for some integer k by 

k 

ri = E^(^ ? ^ ( 4 ) 

The choice of k = k n is crucial ; all the (A,-) cannot be null and one should stress 

that A" 1 t +°° when j increases. The reader will note that we could define equivalently 
rt = Ylj=i ^J 1 ( e i ® e j) ■ From the definition of the regularized inverse above, we can derive a 
useful equation. Indeed, let denote the projection of the k first eigenvectors of F n , that is 
the projection on span(ei, ...,efc) . Then r„T n = TnTn = n^. For further purpose we define as 
well Ilk to be the projection operator on (the space spanned by) the k first eigenvectors of F. 

Remark 1 The regularization method we propose is the most intuitive to us but may be changed 
by considering : F^ f = Ylj=i fn ("\?) i^j & '??') where f n is a smooth function which con- 
verges pointwise to x — >■ 1/x. For instance, we could choose fn^Xj^j = (a n + Xj^j where 

a n > and a n I 0, and Fn would be the penalized-regularized inverse ofF n . Taking f n {^Xj^j = 
- \ -1 



Xj ya n + Xjj leads to a Tikhonov regularization. We refer to the remarks within section 3 of 

Cardot, Mas, Sarda (2007) to check that additional assumptions on f n (controlling the rate of 
convergence of f n to x 1/x) allow to generalize the overall approach of this work to the class 
of estimates Fl f . 
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To conclude this subsection, we refer the reader interested by the topic of inverse problem 
solving to the following books : Tikhonov and Arsenin (1977), Groetsch (1993), Engl, Hanke 
and Neubauer (2000). 

1.4 Assumptions 

The assumptions we need are classically of three types : regularity of the regression parameter 
S, moment assumptions on X and regularity assumptions on X which are often expressed in 
terms of spectral properties of V (especially the rate of decrease to zero of its eigenvalues) . 
Assumption on S 

As announced sooner, we assume that S is Hilbert Schmidt which may be rewritten : for 
any basis {4>j) j&i of H 

£<S(&),^> 2 <+oo. (5) 

M 

This assumption finally echoes assumption ^ ■ /3| < +oo in the functional linear model © 
with real ouptuts. We already underlined that flS]) is equivalent to assuming that S is doubly 
integrable if H is L 2 ([0, T]). Finally no continuity or smoothness is required for the kernel S at 
this point. 

Moment assumptions on X 

In order to better understand the moment assumptions on X, we recall the Karhunen-Loeve 
development, which is nothing but the decomposition of X in the basis of the eigenvectors of 
T, X = Yuj=i \f\j^j e j a - s - where the £j's are independent centered real random variables 
with unit variance. We need higher moment assumptions because we need to apply Bernstein's 
exponential inequality to functionals of T — T n . We assume that for all j, i £ N there exists a 
constant b such that 

E(ie/)<§^- 2 .E(ie/) (6) 

which echoes the assumption (2.19) p. 49 in Bosq (2000). As a consequence, we see that 

E(A, ej -) 4 < C (E(A,e,-) 2 ) 2 . (7) 

This requirement already appears in several papers. It assesses that the sequence of the fourth 
moment of the margins of X tends to quickly enough. The assumptions above always hold for 
a gaussian X. These assumptions are close to the moment assumptions usually required when 
rates of convergence are addressed. 

Assumptions on the spectrum of T 

The covariance operator T is assumed to be injective hence with strictly positive eigenvalues 
arranged in a decreasing order. Let the function A : M + — > M + * be defined by A (j) = Xj for any 
j G N (the Aj's are continuously interpolated between j and j + 1. ^From the assumption above 
we already know that ]TV Xj < +oo. Indeed the summability of the eigenvalues of T is ensured 

1 1 2 

whenever E \\X\\ < +oo. Besides, assume that for x large enough 

x — > X (x) is convex. (8) 

These last conditions are mild and match a very large class of eigenvalues : with arithmetic 
decay Xj = Cj^ 1-0 where a > (like in Hall and Horowitz (2007)), with exponential decay 
Xj = Cj~P exp (— aj), Laurent series Xj = Cj~ l ~ a (log j) or even Xj = Cj~ l (log j)~ l ~ a ■ Such 
a rate of decay occurs for extremely irregular processes, even more irregular than the Brownian 
motion for which Aj = Cj~ 2 . In fact our framework initially relaxes prior assumptions on the 
rate of decay of the eigenvalues, hence on the regularity of X. It will be seen later that exact risk 
and optimality are obtained when considering specific classes of eigenvalues. Assumption ([8]) is 
crucial however since the most general Lemmas rely on convex inequalitites for the eigenvalues. 
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2 Asymptotic results 



We are now in a position to introduce our estimate. 

Definition 2 The estimate S n of S is defined by : S n = A n Tn, the associated predictor is 
Y n+ i = S n {X n+ {) = A n Tn (X n+ i) . It is possible to provide a kernel form. We deduce from 
S n = A n rj, that 

-. n k p -T7- — 

S n (s,t) = ^^l^.Y t (t)e J (s). 

71 i=l j=l A i 

Though distinct, this estimate remains close from the one proposed in Yao, Miiller and 
Wang (2005), the difference consisting in the fact that we do not consider a Karhunen-Loeve 
development of Y. In the sequel, our main results are usually given in term of S n but we 
frequently switch to the 'kernel' viewpoint since it may be sometimes more illustrative. Then 
we implicitely assume that H = L 2 ([0, 1]) . 

We insist on our philosophy. Estimating S is not our seminal concern. We focus on the 
predictor at a random design point X n+ i, independent from the initial sample. The issue of 
estimating S itself may arise typically for testing. As shown later in this work and as mentioned 
in Crambes, Kneip and Sarda (2009), considering the prediction mean square error finally comes 
down to studying the mean square error of S for a smooth, intrinsic norm, depending on T. From 
now on, all our results are stated when assumptions of the subsection 11.41 hold. 



2.1 Mean square prediction error and optimality 

We start with an upper bound from which we deduce, as a Corollary, the exact asymptotic risk 
of the predictor. What is considered here is the predictor Y n+ \ based on S n and X n+ \. It is 
compared with E (Y n+ i\X n+ i) = S (X n+ i) . Let r e = E (e (g> e) be the covariance operator of the 
noise and denote a 2 = tvF £ . 

Theorem 3 The mean square prediction error of our estimate has the following exact asymp- 
totic development : 



E 



2 h + °° 

S n (X n+l )-S(X n+1 ) =o- 2 £ -+ A J ||S(e j )|| 2 + A n + £ n , (9) 

j=k+i 



where A n < Ca \\S\\£ k 2 ^k/ n an d B n < CBk 2 logk/n 2 where Ca and Cb are constants which 
do not depend on k, n or S. 

The two first term determine the convergence rate : the variance effect appears through 
o~ 2 k/n and the bias (related to smoothness) through Ylt^+i^j \\S ( e j)l| 2 - Several comments 
are needed at this point. The term A n comes from bias decomposition ctnd B n is a residue 
from variance. Both are negligible with respect to the first two terms. Indeed, — > since 

Afc < +oo and A n = o(k/n) . Turning to B n is a little bit more tricky. It can be seen 
from the lines just above the forthcoming Proposition [8] that necessarily (fclog k) 2 jn — > which 
ensures that B n = o{k/n) . A second interesting property arises from Theorem [3l Rewriting 
Xj \\S (ej)\\ 2 = ||ST 1//2 (ej)|| we see that the only regularity assumptions needed may be made 
from the spectral decomposition of the operator ST 1 / 2 itself and not from X (or T as well) and 
S separately. 

Before turning to optimality we introduce the class of parameters S over which optimality 
will be obtained. 
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Definition 4 Let ip : IR + — > R + be a C 1 decreasing function such that Sj=^L f U) = 1 an d se ^ 
£2 (<£, L) be the class of linear operator from H to H be defined by 

C 2 (<p,L) = {Te£ 2 ,\\T\\ Ca <L:\\T(e j )\\ <L^{j)]. 

The set £2 (ip, L) is entirely determined by the bounding constant L and the function ip. 
Horowitz and Hall (2007) consider the case when <p (j) = Cj~( a+2 ^ where a > 1 and /3 > 1/2. 
As mentioned earlier we are free here to take any ip such that f + °° <p (s) ds < +00 and which 
leaves assumption (JSj) unchanged. 

As an easy consequence, we derive the uniform bound with exact constants below. 

Theorem 5 Set L = 1 1 ST 1 / 2 1 1 ^ , ip (j) = Xj \\S (ej)|| 2 / L 2 and k* n as the integer part of the 
unique solution of the integral equation (in x) : 

1 f + °° 1 o* 

- psdx = --§. 10 
x J x n L z 

Let 1Z n (ip, L) be the uniform prediction risk of the estimate S n over the class £2 (</?, L) : 

^ 2 
TZ n (ip, L) = sup E S n (X n+1 ) - S (X n+ i) 

then 

lim sup — TZ n (ip,L) = 2a 2 e . 

Display ()10p has a unique solution because the function of x on the left hand is strictly 
decreasing. The integer A:* is the optimal dimension : the parameter which minimizes the 
prediction risk. It plays the same role as the optimal bandwidth in nonparametric regression. 
The upper bound in the display above is obvious from ([9]). This upper bound is attained when 
taking for S the diagonal operator defined in the basis of eigenvectors by Sej = Lip 1 / 2 (j) X- 1//2 ej. 
The proof of this Theorem is an easy consequence of Theorem [3] hence omitted. 

The next Corollary is an attempt to illustrate the consequences of the previous Theorem 
by taking explicit sequences (<p(j))j^- We chose to treat the case of general Laurent series 
(including very irregular input and parameter when a = 0) and the case of exponential decay. 

Corollary 6 Set <p a (j) = C afi (j 2+a (log j)*) ~ and ip b (j) = C' a exp (-aj) where either a > 
and /3 £ R or a = and f3 > 1, C a ^ and C' a are normalizing constants, then 

K n {(p a , L) ~ 
TZ n (ipb,L) < 



n (l+a)/(2+a) ^ 2a 2 
logn 



an 



In the second display we could not compute an exact bound because equation (|T0j) has no 
explicit solution. But the term (logn) /an is obviously sharp since parametric up to logn. The 
special case (3 = and a > 1 matches the optimal rate derived in Hall and Horowitz (2007) with 
a slight damage due to the fact that the model shows more complexity (S is a function of two 
variables whereas j3 the slope parameter in the latter article and in model ([2]) was a function of a 
single variable). We also refer the reader to Stone (1982) who underlines this effect of dimension 
on the convergence rates in order to check that our result matches the ones announced by Stone. 

In our setting the data Y are infinite dimensional. Obtaining lower bound for optimality 
in minimax version is slightly different than in the case studied in Hall and Horowitz (2007), 
Crambes, Kneip and Sarda (2009). In order to get a lower bound, our method is close to the 
one carried out by Cardot and Johannes (2010), based on a variant of Assouad's Lemma. We 
consider gaussian observations under 2 kn distinct models. 
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Theorem 7 The following bound on the minimax asymptotic risk up to constants proves that 
our estimator is optimal in minimax sense : 



inf sup E 



S n O^rM-l) — S (X n +l) 



n 



It appears that another upper bound may be derived from Q. We can avoid to introduce 



°~e an d J2j \\S ( e j 



\S\\r we see that the sequences A,- 



the class £2 (<P, L) . /From Y2j -\? 

and || 5 (ej)|| 2 may be both bounded by j'~ 1 (logj) _1 hence that Xj \\S (ej)\\ 2 < j -2 (logj) -2 . A 



classical sum-integral comparison yields then X^j>fc+i Aj ||5(ej)|| < Ck^{\ogk)- z . We obtain 
in the Proposition below a new bound for which no regularity assumption is needed for S. 



Proposition 8 The following bound shows uniformity with respect to all Hilbert- Schmidt opera- 
tors S (hence any integrable kernel S ) and all functional data matching the moment assumptions 
mentioned above : 



sup E 



<L 



S n (-Xn+l) — S (X n+ \ 



<a 2 e - + C- 
n 



L 2 



2 1 ' 



k log z k 



where C is a universal constant. We deduce the uniform bound with no regularity assumption 
on the data or on S : 



lim sup y^logn sup E 
\\S\\ C <L 



s n {x n +i) — s (X n 



+1, 



< a l e + CL 2 . 



The bound above is rough. The constant C does not really matter. The fundamental idea 
of the Proposition is to provide an upper bound for the rate uniformly on balls of £2 without 
regularity restrictions : if a n is the rate of prediction error in square norm considered above, 



then necessarily a n < rT 1 / 2 '(log n 
the unknown parameter S. 



(in fact we even have a T 



o [n 



-1 / 2 (logn) 1 )) whatever 



Remark 9 The bound above holds with highly irregular data (for instance when Xj X Cj (log j 
with a > or with very regular data featuring a flat spectrum with Xj x Cj~ 1 exp (—aj) or even 
the intermediate situation like Xj X Cj -1- ^ (log j) 1+a )- The literature on linear regression with 
functional data usually addressed such issues in restrained case with prior knowledge upon the 
eignevalues like Xj x Cj~ l ~^ . The same remarks are valid when turning to the regularity of the 
kernel S or of the operator S expressed through the sequence \\S (ej)\\ 2 . Obviously in the case 
of rapid decay (say at an exponential rate Xj X C exp (—aj)) one may argue that multivariate 
method would fit the data with much accuracy. We answer that, conversely in such a situation 
-fitting a linear regression model- the usual mean square methods turn out to be extremely un- 
stable due to ill- conditioning. Our method of proof shows that smooth, regular processes (with 
rapid decay of Xj) have good approximation properties but ill-conditioned T n (i.e. with rapidly 
increasing norm) damaging the rate of convergence of S n which depends on it. But we readily see 
that irregular processes (with slowly decreasing Xj), despite their poor approximation properties, 
lead to a slowly increasing Tn and to solving an easier inverse problem. 



-l-Of 



Remark 10 At this point it is worth giving a general comment on the rate of increase of the 
sequence k n . From the few lines above Proposition^ we always have (/clog k) 2 jn — > whatever 
the parameter S in the space of Hilbert- Schmidt operators. This property will be useful for 
asymptotics and the mathematical derivations given in the last section. 
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2.2 Weak convergence 

The next and last result deals with weak convergence. We start with a negative result which 
shows that due to the underlying inverse problem, the issue of weak convergence cannot be 
addressed under too strong topologies. 

Theorem 11 It is impossible for S n to converge in distribution for the Hilbert- Schmidt norm. 

Once again turning to the predictor, hence smoothing the estimated operator, will produce 
a positive result. We improve twofold the results by Cardot, Mas and Sarda (2007) since first 
the model is more general and second we remove the bias term. Weak convergence (convergence 
in distribution) is denoted — > . The reader should pay attention to the fact that the following 
Theorem holds in space of functions (here H). Within this theorem, two results are proved. 
The first assesses weak convergence for the predictor with a bias term. The second removes this 
bias at the expense of a more specific assumption on the sequence k n . 

Theorem 12 If the condition (k log k) 2 /n — > holds, then 



t S n (X n+ i) — SUk (X n+ i 



w r 



where Q £ is a centered gaussian random element with values in H and covariance operator T £ . 
Besides, denoting 7& = sup J>fc {jlogj [|i?(ej-)[| V^i} ( ^ ^ s P^ 71 ^at 7^ — > 0) and choosing k 



such that n < (k log k) 2 (which means that (k log k) 2 jn should not decay too quickly to zero), 
the bias term can be removed and we obtain 

^ S n (X n +i) — S (X n+ i) —> Q £ . 

Remark 13 We pointed out above the improvement in estimating the rate of decrease of the 
bias. The proof of the Theorem comes down to proving weak convergence of a series with values 
in the space H . More precisely, an array Ya=i z i,n^i appears where Zi >n are real valued random 
variables with increasing variances (when n — > +00) which are not independent but turn out to 
be martingale differences. 

^From Theoremll2lwe deduce general confidence sets for the predictor : let K, be a continuous 
set for the measure induced by Q £ , that is P {Q £ G dlC) = where dK. =/C\int (fC) is the fronteer of 

K, then P {% n (X n+ {) G S (X n+ i) + \p^fc \ -> P (G £ G /C) when n -)■ +00. As an application, we 

propose the two following corollaries of Theorem 1121 The notation Y* +1 stands for S (X n+ i) = 
E (Y n+ i\X n+ i). The first corollary deals with asymptotic confidence sets for general functionals 
of the theoretical predictor such as weighted integrals. 

Corollary 14 Let m be a fixed function in the space H = L 2 ([0, 1]). We have the following 
asymptotic confidence interval for J Y* +1 (t) m (t) dt at level 1 — a : 




Y* +1 (t) m (t) dt G 



J Y n+ i (t) m (t) dt ± \ j^o- rn q 1 _ a /2 



1 — a, 



where = (m, T e m) = f fT £ (s, t) m (t) m (s) dtds rewritten in 'kernel' form and q\- a /2 is the 
quantile of order 1 — a/2 of the Af (0, 1) distribution. 

Theorem 1 1 2 1 holds for the Hilbert norm. In order to derive a confidence interval for Y* +1 (to) 
(where to is fixed in [0,1]), we have to make sure that the evaluation (linear) functional / G 
H 1 — > f (to) is continuous for the norm ||-|| . This functional is always continuous in the space 
(C ([0, 1]) , Hoc) but is not in the space I? ([0, 1]) . A slight change in H will yield the desired 
result, stated in the next Corollary. 
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Corollary 15 When H = W 2,1 ([0,1]) = {/ G L 2 ([0, 1]) : / (0) = 0, /' £ L 2 ([0, 1])} endowed 
with the inner product (u,v) = j^u'v', the evaluation functional is continuous with respect to 
the norm of H and we can derive from Theorem EH : 



y: + i (to) g 



Y n +l (to) ± V -°"*0 Ql-a/2 



1 — a 



where a 2 = T £ (t ,t ) . 



Note that data (Yi) 1<i<n reconstructed by cubic splines and correctly rescaled to match the 



condition [/ (0) = 0] belong to the space Wq 2 ' 1 ([0, 1]) mentioned in the Corollary. 



Remark 16 It is out of the scope of this article to go through all the testing issues which can 
be solved by Theorem \12l It is interesting to note that if S = 0, the Theorem ensures that 



S n (-Xn+l) 



which may be the starting point for a testing procedure of S = versus various alternatives. 



2.3 Comparison with existing results - Conclusion 

The literature on linear models for functional data gave birth to impressive and brilliant recent 
works. We discuss briefly here our contribution with respect to some articles, close in spirit to 
this present paper. 

We consider exactly the same model (with functional outputs) as Yao, Miiller and Wang 
(2005) and our estimate is particularly close to the one they propose. In their work the case of 
longitudinal data was studied with care with possibly sparse and irregular data. They introduce 
a very interesting functional version of the R 2 and prove convergence in probability of their 
estimates in Hilbert-Schmidt. We complete their work by providing the rates and optimality for 
convergence in mean square. 

Our initial philosophy is close to the article by Crambes, Kneip and Sarda (2009). Like 
these authors we consider the prediction with random design. We think that this way seems 
to be the most justified from a statistical point of view. The case of a fixed design gives birth 
to several situations and different rates (with possible oversmoothing which entails parametric 
rates of convergence which are odd in this truly nonparametric model) and does not necessarily 
correspond to the statistical reality. The main differences rely in the fact that our results hold 
in mean square norm rather than in probability for a larger class of data and parameter at the 
expense of more restricted moment assumptions. 

Our methodology is closer to the articles by Hall and Horowitz (2007). They studied the 
prediction risk at a fixed design in the model with real outputs ([2]) but with specified eigenvalues 
namely Xj ~ Cj _1_a and parameter spectral decomposition (/3,ej) ~ Cj _1 ~ 7 with a, 7 > 0. 
The comparisons may be simpler with these works since we share the approach through spectral 
decomposition of operators or Karhunen-Loeve development for the design X. 

The problem of weak convergence is considered only in Yao, Miiller and Wang (2005) : 
they provide very useful and practical pointwise confidence sets which imply estimation of the 
covariance of the noise. Our result may allow to consider a larger class of testing issues through 
delta-methods (we have in mind testing of hypotheses like S = Sq versus Sr n \ = So + r] n v where 
r] n — > and v belongs to a well-chosen set in H). 

The contribution of this article essentially deals with a linear regression model -the concerns 
related to the functional outputs concentrate on lower bounds in optimality results and in 
proving weak convergence with specific techniques adapted to functional data. We hope that 
our methods will demonstrate that optimal results are possible in a general framework and that 
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regularity assumptions can often be relaxed thanks to the compensation (or regularity/inverse 
problem trade-off) phenomenon mentioned within Remark [9l The Hilbert space framework is 
necessary at least in the section devoted to weak convergence. Generalizations to Banach spaces 
of functions could be investigated, for instance in C ([0, 1]) , Holder or Besov spaces. 

Finally we do not investigate in this paper the practical point of view of this prediction 
method. It is a work in progress. Many directions can be considered. The practical choice of k n 
is crucial. Since we provide the exact theoretical formula for the optimal projection dimension 
at (fTUj) it would be interesting to compare it with the results of a cross-validation method on a 
simulated dataset. The covariance structure of the noise is a central and major concern : the 
covariance operator appears in the limiting distribution, its trace determines the optimal choice 
of the dimension k* n . Estimating T e turns out to be challenging both from a practical and applied 
point of view. 



3 Mathematical derivations 



In the sequel, the generic notation C stands for a constant which does not depend on k, n or S. 
All our results are related to the decomposition given below : 



1 n 

s n = sr n rt + u n ri = su k + - V £i ® 4 x { . 



(11) 



i=i 



It is plain that a bias-variance decomposition is exhibited just above. The random projection 
life is not a satisfactory term and we intend to remove it and to replace it with its non-random 
counterpart. When turning to the predictor, (|lip may be enhanced : 



S n (-X'n+l) ~~ S (X n +l) 

= S(U k - I) (X n+1 ) + s 



(12) 



(x n+1 ) + — (r n Xi,x n+1 ^ , 



i=i 



where IL^ is defined in the same way as we defined Uk previously, i.e. the projection on the k 
first eigenvectors of T. 

In terms of mean square error, the following easily stems from E (£«|X) = : 

~ 2 
S n {X n +l) — S (X n+ i) 

2 



E 



E 



Silk (Xn+l) — S (X n+ i) 



+ E 



We prove below that : 



E 



S 



n fc -m 



{X.. 



n+1, 



o E 



i=l 



(13) 



and that the two terms that actually influence the mean square error are the first and the third 
in display (|12p . The first term S (Ilfc — /) (X n+ i) is the bias term and the third a variance term 
(see display ©). 

The proofs are split into two parts. In the first, part we provide some technical lemmas 
which are collected there to enhance the reading of the second part devoted to the proof of the 
main results. In all the sequel, the sequence k = k n depends on n even if this index is dropped. 
We assume that all the assumptions mentioned earlier in the paper hold ; they will be however 
recalled when addressing crucial steps. We assume once and for all that {k log k) 2 jn — > as 
announced in Remark 1 1 1 ab ove . The rate of convergence to of (k log k) 2 jn will be tuned when 
dealing with weak convergence. 
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3.1 Preliminary material 



All along the proofs, we will make an intensive use of perturbation theory for bounded operators. 
It may be useful to have basic notions about spectral representation of bounded operators and 
perturbation theory. We refer to Kato (1976), Dunford and Schwartz (1988, Chapter VII. 3) 
or to Gohberg, Goldberg and Kaashoek (1991) for an introduction to functional calculus for 
operators related with Riesz integrals. Roughly speaking, several results mentioned below and 
throughout the article may be easily understood by considering the formula of residues for 
analytic functions on the complex plane (see Rudin (1987)) and extending it to functions still 
defined on the complex plane but with values in the space of operators. The introduction of 
Gohberg, Goldberg and Kaashoek (1991, pp. 4-16) is illuminating with respect to this issue. 

Let us denote by Bj the oriented circle of the complex plane with center Xj and radius Sj/2 
where Sj = min{Aj — Aj+i, Xj-i — Xj} = Xj — Aj+i, the last equality coming from the convexity 
associated to the Aj's. Let us define Ck = Uj=i^j -The open domain whose boundary is Ck is 
not connected but we can apply the functional calculus for bounded operators (see Dunford- 
Schwartz, Section VII. 3, Definitions 8 and 9). With this formalism at hand it is easy to prove 
the following formulas : 




The same is true with the random r n , but the contour Ck must be replaced by its random 
counterpart Ck = Uj=i &j where each Bj is a random ball of the complex plane with center Xj 
and for instance a radius Sj/2 with plain notations. Then 

This first lemma is based on convex inequalities. In the sequel, much depends on the bounds 
derived in this Lemma. 



Lemma 17 Consider two large enough positive integers j and k such that k > j. Then 



jXj > kX k , Xj - X k > ( 1 - 3 t J Aj, ^Xj < (k + l) X k . 



E 



Xi 



\Xk - Xj 



< Cklogk. 



Besides 



E sup 

zeBi 



(zI-TY 1 / 2 (T-T n )(zl-T)- 



-1/2 



c 



„ < - (jlogj 
c 2 n 



,-\2 



(16) 



The proof of this lemma will be found in Cardot, Mas, Sarda (2007), pp. 339-342. 
We introduce the following event : 



Ai = { Vj G {l,...,k n }, 



^3 ~ 



< 1/2 y 



which decribes the way the estimated eigenvalues concentrate around the population ones : the 
higher the index j the closer are the Aj's to the Aj's. 
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Proposition 18 // (k log k) 2 jn — > 0, 

P(limsupA) = 0. 

Proof : We just check that the Borel-Cantelli lemma holds Yln=i ^ 0^»O < "I" 00 wnere 



(A) =P(3j G {1,...,^} 



A j ~ Aj 



/ffj > 1/2 



fc k 

< E p (1^- - M / A i > <v ( 2A ^)) < E p (1^- - M / A ^ > V2 a + 1)) • 

Now, applying the asymptotic results proved in Bosq (2000) at page 122-124, we see that the 



asymptotic behaviour of 



/Xj > ) is the same as 



1 n 



> 



A; 



2(j + l) 



We apply Bernstein's exponential inequality -which is possible due to assumption ([6|)- to the 
latter, and we obtain (for the sake of brevity j + 1 was replaced by j in the right side of the 
probability but this does not change the final result) : 



/ n 

V i=i 



> 



Ai 
2j 



< 2 exp 



1 



j 2 8c+l/(6j) 



< 2 exp -C 



n 



and then 



k 



>|)< 2fcexp (_^,. 



Now it is plain from (/clog A;) 2 jn — > that A; exp (-C|j) < \/n 1+e for some e > which leads 
to checking that ^ n /c n exp (^—C-^j < +oo, and to the statement of Proposition [TBI through 
Borel-Cantelli's Lemma. 

Corollary 19 We may write 

nfc n = tt— / ( zI -TnT 1 dz, Tl l = — [ -{zl -Y n )~ x dz a.s., 
2vri J Ck 2tu J Ck z 

where this time the contour is C k hence no more random. 

Proof : From Proposition 1181 it is plain that we may assume that almost surely Xj £ Bj for 
j G {1, k} . Then the formulas above easily stem from perturbation theory (see Kato (1976), 
Dunford and Schwartz (1988) for instance). 

3.2 Proofs of the main results 

We start with proving (|13p as announced in the foreword of this section. What we give here is 
nothing but the term A n in Theorem [3J 



Proposition 20 The following bound holds : 



E 



S(U k -U k )(X n+1 ) 



< c 



k 2 X k 



n 



I SI 



c 2 
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Proof : We start with noting that 



E 



S(U k -U k )(X n+1 ) 



E 



tr(T[n k -U k )S*S 

2 



n*-n fc )) 



e \\s ( fL - n fc ) r 1 / 2 



c 2 



j=l 1=1 



By Corollary [19l we have 



" Uk = ti E / - r ™) _1 - ( z/ - r ) -1 } dz = E T < 

m=l jBm m=l 



(17) 



where T m ^ n = ^ f Bm (zl - T n ) 1 (r - T n ) (zl - T) 1 dz. 

To go ahead now, we ask the reader to accept momentaneously that for all m < k, the 
asymptotic behaviour of T m n is the same as 



-!- ! (zi- ry 1 (r - r n ) (z/ - r)- 1 dz, 



where the random (zl — T n ) 1 was replaced by the non-random (zl — T) 1 and that studying 
fife — life comes down to studying 



m=l 



The proof that this switch is allowed is postponed to Lemma [2TJ We go on with 

k 

2 77/ 



s' :Ib, II/, 



2tu 



^ m =l 

E / ((^-rr 1 (r-r n )(e,),^) 

rn=l ^ Bm 



z — A,- 



where S* is the adjoint operator of S. We obtain 

/ ((zi-rr^r-iVKe,-),^ 



z — A,- 



dz 

Z — Xn 



f + °° dz 



We deduce that 

(s(u k -u k ) r 1 / 2 ( ej ),e e 



2tu 



+oo A; „ 

E (( r ~ r «) fe) ' e J v ) ( 5 * e ^ e J v ) E / 

j'=l m=l Br 



riz 



(z - A,-) (z - Aj/) 
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then 



+ 



where 



s(fi fc -II fc ) T 1 ' 2 {ej),e t ) 

fx~ ^ ^ 

- ^7 E <( r - r «) ( s *^ e r) E / 

j'=l m=l ^' 

— +00 fc „ 

f ((T-T n )(e j ),e j/ )(S*e e ,e j! )J2 / 

j'=k+l m=l JB 



2iu 



dz 

(z - Xj) (z - A,-/) 
(z - Aj) (z - Aj/) ' 



if 



— J J J - J 

= I (Aj - A^) - if / > m,j < m, 

(z - Xj) (z - X f ) I (X f - Xj)' 1 if f <m,j > m, 
J 1 - 1 = if j,f < m. 



Then 



+00 

E( 5 ( fi *-n 

3=1 



+00 

+ E 

j=k+l 



fx, • 



; )r'/' fe ), e( ) 2 = X: 

* ((r-r ra ) (ejhe,, 
2^ (\,,-\A 



rc~ +00 



((r-r») (ej),ej/) 

2*/ f- ( (Aj/ - A,) V J 7 



n 2 



V- U 1 - i nKejJ,e J -/; . , 



= A + 5, 



where 



A = 



£ = 



1 
1 

4^2 



k 




+00 


E^- 




E 




/ 


=fc+i 


+00 






E * 




E 


j=fc+i 




j'=i 



1 2 



We first compute EA To that aim we focus on 

Jg ((r-r„)( e ,) e,-,> (g , e6ej| T = g E<(T -nfa)..,)' >a 
+ £ e ( ( r n - r) (e 3 -) , e ,-,) < ( r - r B ) (e 3 -) , e 3 „) f* e ^ f* ei :7'l 

j',j"=k+i \ i i'> ^ i i"> 



j +00 
n 



E A i A .j' /o* \ 2 

1 r, , 9 . . . ,i (S*eg, ej/) (S*et, e,») 

+ " E E (X, ej -) 2 X,e 3 v X, ej - W ) J { 

n j-'j^+i L J ^ A J ~ VJ l A J ~ VO 
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Then 



E 



j'=k+l \ J ' 



- Cl ~ 2^ 7T m W e W/ +C2 TT 2^ v Wtt TTTx TT 

n j'tt+i (*i - Aj') n r J^ k+1 (A, - A^j (Aj - \ f ,) 



(Aj - Aj') (Aj - Aj«) 



We could prove exactly in the 
E 

We turn back to 



A ((r w - T) (ej) ,ej/) 
(A/ - A,") 



same way that 

2 



(S'*e£, e,-/ 



n 



+oo 



+oo /X - 

E rA _1 y (^ e i') 
j'=k+l \ A 3 A 3') j'=k+l 

2k nr— +oo 

= E 7r^rTl^ e ^l + E 

j'=fc+l ^ j ^ j'=2fc- 



j'=2fc+l ( A J " A J') 
/T 2fc _ +oo 

TX^T E \(S*e,e r )\ + l Y, 
(A, - A fc+ i) Xj .,^ 4 



A,-/ \ \S*t 



hence 



e^4 < -Fa] 



2fc 



— ^Ssf E 

(Aj - A fc+ i) \ jl=k+1 



+ 



E x 

\j'=2k+l 



The term below is bounded by : 



ni I +oo +oo 

f E v E i<» 



2 \ +2$; 

j'=2k+l j>=2k+l I " i'=2fc4- 



(18) 



(19) 



\j>=2k+l j'=2k+l J " j'=2k+l 

because Yly=2k+i A j' — i^k + 1) A2/C+1 < fcAfc by Lemma [TTJ We focus on the term on line 

k F . / 2k \ 2 1 k [ / X 2 / 2fc 

E A ? „ r ,2 1 E K*WI| < A * + iE (t^ 1 -) ( E 



HD: 



(Aj - A fc+ i) 2 L, =fc+ , 



< Afc+i 



2fc 



<* E K^.ej') 
Vi'=fc+i 



k / 2k \ 

+ i) 2 A fc+1 E-i <c E K^ e />l 2 

5=1 7 V'=fc+1 / 



l2 



^ 2 Afc+i, 
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hence EA < ^ (Ej=fc+i I ( s * e ^ e j') f) fe2A fc- We turn to 

proving a similar bound for 5. The 

method is given because it is significantly distinct. We start from (|18p and we denote [^J the 
largest integer smaller than x : 



n i ^ 



A,- 



,v = 1 ("V - A i) 



e,/ 



/l*/2j A— 

E 



n 



\j>=i (V - A i) 
/Lfc/2J j 



(5*e/ jei /)|] +[ Yl ( A _\.) |(g*e<,e,-/)| 



Lfc/2j 



U'=Lfc/2j 



i i 



fc E ( s * e ^ e 3 



Afr — A, 7 — 



i'=i 



3 J j'=[k/2\ 



j'=l 



n \j — k 



E 

j'=Vk/2\ 



;,From the definition of B, we get finally 



+oo 



+oo 



e^<^ E|{^, ej ,>| 2 e^ + E (s-W E^. 



j=fc+i 



u'=Lfc/2j 



j=k+l 



j - k 



It is plain that, for sufficiently large k, Ylj'=\k/2\ (S* e ii e j') 2 — (otherwise (S*e£,ej'Y 
cannot converge), whence 



< 



E ( 5 * e *> e . 

i'=Lfc/2j 

2k 

E 



+oo 



n 



J 



j - k 



2 +oo 

+ 4^A, 



j — k 



< 



C 



n 



2k 

E ^ 



j - k 



2 +00 



j=2k 



j -k 



j=k+l w ' j=2k 

Denoting x fc = sup fe+1 < :? < 2fc (j logjAj) we get at last : 

2 



2k 

E >* 

j=k+i 



j 



j 2fc 

^^sup (j'bgjAj)— 1 



j — k ) k+l<j<2k 



log k . ^ri i - fe 



1 



< x^. — - — — — < c/cxfc, 

log k t— 1 j 

° 7 = 1 



and E£ < C^x k (j2f=i \ { S * e ^ e j')\ 2 ) > with x fc ~> °- Finally : 



+oo +oo 

E£(*(n*-n, 

J=l <?=1 



+oo +oo 



ej),eej < C-x fe EE K^*' e J 



This last bound almost concludes the rather long proof of Proposition [20J It remains to 
ensure that switching n and T mj „ as announced just below display (fTT|) is possible. 
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Lemma 21 We have 
+00 +00 



j=i 1=1 



+00 +00 k 



ej ),e e ) ^EY,Y,Y,\ S V 1/2 (e j ),e e 

j=l £=1 m=l 



In other words, switching n and T mj „ is possible in display 

The proof of this Lemma is close to the control of second order term at page 351-352 of 
Cardot, Mas and Sarda (2007) and we will give a sketch of it. We start from : 

T m ,n = I Tn)" 1 (r - T n ) (zl - T)' 1 dz 

27Ti JB m 

= -L / (zi- v)- 1 ' 2 R n (z) (zi - vy 1 ' 2 (r - r n ) (zi - vy 1 dz, 

with R n (z) = (zi — r) 1//2 (zi — r n ) _1 (zi — r) 1 / 2 . Besides, as can be seen from Lemma 4 in 
Cardot, Mas and Sarda (2007) 

1 + {zi - vy 1 ' 2 (r - r n ) (zi - iy 1/2 ] R n (z) = i. 

Denoting S n {z) = (zi - r)~ 1/2 (r - T n ) (zi - I)~ 1/2 , it is plain that when ||5„ (z)\\ < 1 for all 
z G C k , 



+00 



R n {z) = I+Y J ("l) m (z) :=I + R° n (z) , 



m=l 



with (z)\\ < C \\S n (z)\\ OQ for all z G C k . Turning back to our initial equation we get, 
conditionally to \\S n (z)|| < 1 for all z G C k : 

T m , n - r* jn = -r— f {zi- vy 1 ' 2 R° n (z) (zi - rr 1 / 2 (r - r n ) (zi - ry 1 dz, 

and we confine to considering only the first term in the devlopment of R9 n (z) which writes 

^r 1 ! Bm (zi - vy 1 ' 2 si (z) (zi - vy 1 / 2 dz. 

Now split S (u k ~ Ilk) I 1/2 = s(u k - n k ) I 1/2 llj + S fn fc - n fe ) T 1 / 2 !^ where 



J = < sup 

zdC k 



{zl-Vy 1 / 2 (T-T n )(zl-T) 



-1/2 



C-2 



and r n will be tuned later. We have : 



and 



< 



E 



s(n fe -iOr 1/2 ii 



< 4 



£2 



ST l/2 



C-2 



s 



s 



n fc -n fc )-^r,; ir 



m=l 



r l/2 H 



J 



C-2 



E ( 27rt ) _1 / ( z/ - r )~ 1/2 s " (*) ( zI - r r 1/2 dz 



r l/2 n 



£2 



< (2 ^)-i *Vi ra sup { (zi - vy 1 ' 2 v 1 ' 2 s (zi - ry 1 / 2 } 



T 2r,2 * 



< (2-r 1 ii5iL^E^ 



,m. 



m=l 



(20) 
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Now from 5Zm=l m ^m < +00 we get 5 m m < cj \Jm log to hence Em=i A 



m 



o y\/k n /nj whenever k^T^/n 3 — > 0. 

The last step consists in controlling the right hand side of (|20p . In Cardot, Mas and Sarda 
(2007) this is done by classical Markov moment assumptions under the condition that k\ log 4 n jn 
tends to zero. Here, Bernstein's exponential inequality yields a tighter bound and ensures that 
P (j\ = o (k n /n) when k\ log 2 k n /n tends to zero. The method of proof is close in spirit though 
slightly more intricate than Proposition 1181 



Proposition 22 Let T n = ± £? =1 Ei (r n X h X n+1 ), then 



2 tr 

Wn\\ 2 = —k + — 



rE (ri - rt 



Remark 23 We see that the right hand side in the display above matches the decomposition in 



and tr 



ri ( ri - rt 



jn is precisely B n in Theorem [5j 



Proof : 

We have 



1 n 2 1™ 

^™l| 2 = ^2 H £i l| 2 (j^nXi, X n+ xj + —2 ( £ ii £ i') (^i,Xi, X n+ i^ (v^Xii , X n+ l. 



1=1 



We take expectations in the display above and we note that the distribution of each member 
of the first series on the right hand side does not depend on n or i and, due to linearity of 
expectation and E (ej|Xj) = 0, the expectation of the second series is null, hence 



E||T n || 2 = ^E 
n 



£ ill 2 ( rl-Xi , -Xn+i 



— E <! E 

n 

Ie 

n 



£ ill 2 ( , -Xn+i ) |ei,Xi,...,x 



kill l^i 



E/rtrrtXx,^ 



We focus on E ^T^rrJiXi, and we see that this expectation is nothing but the expectation 
of the trace of the operator Tnir! • (X\ ® X\), hence 



e (rt rrtx 1; x x ) = e (4 rrj, x u x { ) = e [trrt rrt . (x { ® x, 



and 



k Mrriia! =-e 



n 



trrtrrt .^(Xi®^ 



«=i 



Etr 



4 rrt r n ] = Etr [ r t rnfe 



Etr 



n^rtr 



tr 



rErl 



At last we get : 



e ( rt rrt x x ,x 1 )=ti rr f + tr te (rt - r f 



rE ( it - r f 



k + tr 



^From Lemma [Ml just below, we deduce that tr TE (Tn — rt^ = o(k), which finishes the 
proof of Proposition [ 
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Lemma 24 We have tr 



< Ck 2 (log A;) /n, where C does not depend on S, n or 



te (vl - rt 

k. The preceding bound is an o(k) since k (log fc) jn — > 0. 
Proof : We focus on 

(rt - rt) = - I -(zi- Tn)- 1 (r n - r) (zi - ry 1 dz 
= - [ -(zi-rr^Tn-^izi-ryUz 
- f \{zi- r n y l (r n - r) (zi - vy 1 (r n - r) (zi - ry 1 dz. 

J C n 

But e j Cn i (zi - ry 1 (r n - r) (zi - ry 1 dz = j Cn \ (zi - ry 1 e (r n - r) (zi - ry 1 dz 

so we consider the second term above 



R r , 



f ~{zi- Tn)- 1 (r„ - r) (zi - ry 1 (r n - r) (zi - vy 1 dz 

J Cn 



1 



- (zi - T)- l/2 T n (z) A n (z) A n (z) (zi - ry 1 '' dz, 



-1/2. 



where 



T n (z) = (zi - T) 1 ' 2 (zi - Tn)- 1 (zi - T) 1 ' 2 , A n (z) = (zi - T)~ 1 / 2 (T n - V) (zi - Ty 1 ' 2 
whence 

A, 



+oo „ 

ti[TR n ] = Y, / 

„" — i J C r 



+ O0 



3=1 

A 



z — A 



(T n (z) A n (z) A n (z) (ej) , (ej)) dz 



V 3 —(T n (z)A n {z)A n (z)(e j ),(e j ))dz= / tr 



3=1 



(zi - ry 1 TT n (z) A n (z) A n (z) 
dz. Indeed, if we denote 



dz, 



and|tr[ri? n ]|</ Cn [ {zi — Ty 1 TT n (z) J\A n (z)\\ 2 C2 

tr [{zi - T) TT n (z) A n (z) A n (z)] = tr \A n (z) T n (z) A n (z 
with f n (z) = ri/ 2 (zi - Tn)' 1 T 1 / 2 symmetric, we obtain 



tr 



An (z) T n (z) A n (z)} = T}J 2 (z) A n (z) < T}J 2 (z) \\A n (z 



C-2 



\c 2 ■ 



Now let us fix m. We have 



T n 1/2 (z) 

first inequality comes from the fact that T n (z) is symmetric, hence T n (z) 
The last one comes from : 



< 



T n (z) and sup zeBji 



T n (z) 



< Cm a.s. The 



SU P||u||<l 



T n (z) u, 



T n (z) = r 1 / 2 (zi - vy 1 ' 2 (zi - r) 1/2 (zi - r n r x (zi - r) 1/2 (zi - ry 1/2 r 1 / 2 , 



and 



T n (z) < {zi-Tf 2 {zi-T n yyzi-T) L '- A (zi-ry'r 



-1 



,1/2 



,-1 



These facts prove (fl3|) . Now, by Lemma [T71 we can write E \\A n (z)^ < C (j log j) 2 /n,and 
consequently E|tr[ri?„]| < C^^M = C \ £* =1 ~ A J+ i) j 3 (log jf . By an Abel 
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transform, we get : 
k 



E (A, - A i+1 ) f (log 3? < ^±ifc 3 (log kf + lf^ X 3 f (log 



fc 2 (logfc) lA fc 2 (logfc) 

< + - VJ (logj) < : 

n n n 

i=i 



which yields E |tr [TR n ] \ < C 



fc 2 (logfc) 



, where C is a universal constant. Finally 



Ck (log A;) jn — > and we proved Lemma Y2M Now we are ready to turn to 



rE (rt - rt 



tr 



heorem [3J 



Ik < 



Proof of Theorem [3] : 

^From equation (fTZj) . we obtain 



E (X n+ i) — S 1 (X n+ \) 



E 



sn fc (x 



ra+lj 



S(X 



n+1, 



+ E 



1 n 

n \ 
i=i 



^From Proposition [22] followed by Lemma 124"! the second term is ^f/e + -B n . It follows from 
Proposition [20] and basic calculations that : 



E 



n+1, 



5(X n+1 ) = E||S(n fc -I)(X n+1 )|| 2 + J 4 n , 



where A n matches the bound of the Theorem. At last E ||5 (I1& — I) (A" n+ i)|| 2 = J2j>k+i^j \\^ e j\\' 
which finishes the proof. 

Proof of Theorem [7] : 

Our proof follows the lines of Cardot, Johannes (2010) through a modified version of As- 
souad's lemma. 

To simplify notations we set A;* = k n . Take S e = Yjj=i 'Hi^i^-i ® &l where U)i 6 { — 1, 1} and 
6 = [u)\, ...,ujk] and r\i E M + will be fixed later such that S e G £z((p,C) for all 6. Denote 
Q-i = [oj\, —LOi, and Pg := Pg [(Yi,Xi) , (Y n ,X n )] denote the distribution of the data 

when S = S 9 . Let p stand for Hellinger's affinity, p(P ,Pi) = / y/dP dP 1 and KL(P ,Pi) for 
Kiillback-Leibler divergence then p(P ,Pi) > (l - ^KL (P ,Pi)) . 

Note that considering models based on S 9 above comes down to projecting the model on 
a one-dimensional space. We are then faced with a linear model with real output and finally 
confine ourselves to proving that the optimal rate is unchanged (see Hall, Horowitz (2007)). 



T^-n (T n 



sup E 

sec 2 (<p,c) 



2 1 ^ n 

(T n -S)T 1 / 2 \\ 2 > ¥ £ Y. X ^((T n -S e )e i ,e l 



»e{-i,i} 



* i=l 



> 



25 E 2^ Al E e((r n -5 e ) ei ,ei) +E e _ i {(T n -5 9 -.) e! ,e 1 f 

W6{-1,1}* i=1 

^ E E^(p»,pO 



u)e{-i,i} k i=1 

The last line was obtained by a slight variant of the bound (A. 9) in Cardot, Johannes (2010), 
p. 405 detailed below : 

f ((T n - S e ) ei ,ei) , f ((T n - S e -*)ei, ei ) , 



1/2 



21 



by Cauchy-Schwartz inequality and since \ ((S e ~ i — S e ) ej,ei) = 2rji. Then 
yields : 



K n (T n ) > inf inf p (Fg, ¥g_.) V X lV . 



W6{-1,1} 



We show below that KL (P e ,P e _.) < inXirjf/aj. Choosing rji = a\j2^Jn\i for 1 < i < k n 
gives S e e £ 2 (<p,l) and sup W)f KL(P flj P fl _ 1 ) < 1, inf w>i /> (Pe,Pe_J < 1/2 and 

whatever the choise of the estimate T n . This proves the lower bound : 



lim sup (fn 1 inf sup E {T n — S)T l l 2 



2 1 

>2' 



and the Theorem stems from this last display. 

We finish by proving that KL (Pe,Pe_ i ) < 4nAj7/ 2 /er 2 . It suffices to notice that 



KL (P*,P*_ 4 ) = y log (dP^x/dP^ix) dP e 



where Pg|x stand for the likelihood of Y conidtionally to X. In this Hilbert setting we must 
clarify the existence of this likelihood ratio. It suffices to prove that Pq\x (Y) <C Po|x (Y) which 
in turn is true when S e X belongs to the RKHS associated to e (see Lifshits (1995)). With other 
words we need that almost surely T £ 1 ^ 2 S e X is finite where T £ is the covariance operator of the 
noise. But T £ 'S e = S e jo\. Set uj\ = u)\ if I 7^ i with uj[ = —uj-i : 



2 



log w e p(Y) = -|( y ' e i)-E^^ e ')) +[(Y,e 1 )-J2"lVl(X, 



= -2ojij]i X, ^ i ( 2(e,ei) + ^ oj t r]i (X,ei) - ^w(r# (X,ei) ) 

a i V 1=1 1=1 J 

= -2u i r H ^^-(2(e,ei)+2oj iVi {X,e i )) 
°l 

and E e [log dP e , x (Y) /dP e _ ilx (Y)] = 4r/?E e (X, e,) 2 /a 2 = 4t£V*i 
Now we focus on the problem of weak convergence. 
Proof of Theorem 1111 : 

Consider (fTTj) . We claim that weak convergence of S n will depend on the series (1/n) Y^i=i e i® 
TnXi. This fact can be checked by inspecting the proof of Theorem [3) We are going to prove 
that (1/n) E?=i e< ® rt ^i 

cannot converge for the classical (supremum) operator norm. We 
replace the random Fn by the non-random 1^ . It is plain that non-convergence of the second se- 
ries implies non-convergence of the first. Suppose that for some sequence a n f +00 the centered 
series (a n /n) X^iLi £ i ® I^Xj — i Z,in operator norm, where Z is a fixed random operator (not 
necessarily gaussian). Then for all fixed x and y in H, ^ J^ILi ( £ i^y) {^-^ii x ) ~> (Zx,y) ,as 
real random variables. First take x in the domain of From j | r 1 ^ 1 1 < +°o> we see that 

E(ei,y) 2 (r"l"Xj,x) 2 < +00 implies that a n = yfn (and Z is gaussian since we apply the central 



1=1 / \ 1=1 



22 



limit theorem for independent random variables). Now take a x such that ||r = +00, then 
E(£i,y) 2 (rtXi,x) = E(ei,y) 2 E(rtx,2;}, and is is easily seen from the definition of that 
-which is positive and implicitely depend on n through k- tends to infinity. Conse- 
quently (1/y/n) X^iLi £ i ® F'Xj cannot converge weakly anymore since the margins related to 
the x's do not converge in distribution. This proves the Theorem. 



The two next Lemmas prepare the proof of Theorem [T2l We set T n = ^ Y17=i e * (j^nXi, X n+ i 
and this series is the crucial term that determines weak convergence. We go quickly through 
the first Lemma since it is close to Lemma 8 p. 355 in Cardot, Mas, Sarda (2007). 



Lemma 25 Fix x in H, then \Jnjk n (T n , x) — > Af (0, <r 2 x ), where a 2 x = E (e^, x) . 

Proof : Let T n be the er-algebra generated by (ei, e n ,X x , X n ). We see that Zf n 
(ei,x) (TnXi, X n+ i^ is a real-valued martingale difference, besides 

2 



E 



Zf 



n+1 



Applying Lemma [2H and results by McLeish (1974) on weak convergence for martingale differ- 
ences arrays yields the Lemma. 



Lemma 26 The random sequence y ~^T n is flatly concentrated and uniformly tight. In fact, if 
V m is the projection operator on the m first eigenvectors of T e and 77 > is a real number 



lim sup sup P 

m— >+oo n 



{I-V m )T n 



> 77 



0. 



Proof : Let V m be the projection operator on the m first eigenvectors of T e . For ^Jk n jnT n 
to be flatly concentrated it is sufficient to prove that for any 77 > 0, 



lim sup sup I 

m— >+oo n 



> rj 



0. 



We have 



(I-V m )T n 



h n 



r] 2 k r , 



j-E (r{ X X , X n+1 ) 2 E ||(J - V m ) eif . 



We see first that sup n 1 



(I - V 



stant and once again following Lemma O N 



> VJ < IK-'" ~~ Vm) £i|| 2 where C is some con- 
ow it is plain that 

2 



limsupE||(/-P m )ei|| z = 0, 

m— »+oo 



because V m was precisely chosen to be projector on the m first eigenvectors of the trace-class 
operator r e . In fact E \\{I — V m ) £i|| 2 = tr [(/ — V m ) r e (I — V m )] ,and this trace is nothing but 
the series summing the eigenvalues of T £ from order m + 1 to infinity, hence the result. 



Proof of Theorem 1121 : We only prove the second part of the theorem : weak convergence 
with no bias. The first part follows immediately. We start again from the decomposition (|12(1 . 
As announced just above, the two first terms vanish with respect to convergence in distribution. 



For S 



11* -m 



(X n+ i), we invoke Proposition 1201 to claim that, whenever k 2 log 2 k/ 



71 



0.. 
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n fc -n fc 



and we just have to deal with the first term, related 



then (n/k)E S 
to bias : S (Ilk - 

Assume first that the mean square of the latter reminder, (n/k) Y^j^k+i \\S ( e i)l| 2 > decays 
to zero. Then the proof of the Theorem is immediate from Lemmas [25] and [26] The sequence 
yj n/k n T n is uniformly tight and its finite dimensional distributions (in the sense of "all finite- 
dimensional projections of \Jn/k n T n ") converge weakly to W (0, a^ x ) . This is enough to claim 
that Theorem 1121 holds. We refer for instance to de Acosta (1970) or Araujo and Gine (1980) 
for checking the validity of this conclusion. 

Finally, the only fact to be proved is lim n ^ +oc (n/k) Y^j=L+i A? ( e i)l| 2 = when tighten- 
ing conditions on the sequence k n . This looks like an Abelian theorem which could be proved 
by special techniques but we prove it in a simple direct way. First, we know by previous re- 



marks (since \j and ||5(e 3 -)|| are convergent series) that At- 
tends to zero. Taking as in the first part of the theorem n 



\S (ejW = tj (j 
= k 2 log 2 fc/V7fc, 



! log j) ,where Tj 
we can focus on 



linifc +00 klc %- k J2t^k + i Tj/ [j 2 log 2 j) . We know that for a sufficiently large k and for all j > k, 



< r,- < e where e > is fixed. Then 



- E 



x k log 2 k 



J o2 



j=k+i 



< 



< 



1 +oo km+k , , 2 r 
1 x - v - KlOg k 



m=l j=km+\ 
+oo / 



SUp Tj 



k 2 log 2 k 



, , 

/ 7fc,~ L \km+i<j<km J J k 2 m 2 log 2 km 
if \ + °° 1 _ 



which removes the bias term and is the desired result. 
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